This is an R Markdown document. It will be my first attempt at the Week 2 Live assignment constructed and commented solely in RMarkdown. I intend to restate each question within the assignment and provide the code within the document. I also am going to practice my GitHub skills by creating a repo for this assignment. My GitHub repo is CTG-SMU-DataHub. I’ll use “knitr” to knit the questions together to make one coherent set.
First, I will load the PlayerBBall.csv into RStudio. Taking a peak at the data set with MS Excel, there are 4550 observations (rows) and 8 columns and the data set contain headings. It would seem that a bar chart with the positions on the x-axis makes sense and then a simple count. After reading in the data set and giving it a variable name, ggplot function within the ggplot2 package will allow this scheme to come alive.
p <- read.csv("/Users/tgarn/OneDrive/Desktop/SMU - MS Data Science/Courses/GitHub/DS_6306_weekly_assignments/Week_2/PlayersBBall.csv", header = TRUE)
library(ggplot2)
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
p %>% ggplot(mapping = aes(x = position)) + geom_bar(stat = "count")This provides the bar chart, as expected, but the number is only an approximation. ggplotly would provide an interactive means to bring specificity to this answer.
library(plotly)##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(ggthemes)
z = p %>% ggplot(mapping = aes(x = position)) + geom_bar(stat = "count") + theme_economist() + labs(xlab = "position", ylab = "frequency", title = "Players in each position")
ggplotly(z)This is a much more satisfying interactive chart that provides an exact count of each position.
There are likely numerous ways to investigate the distribution of the weights of centers (C) and forward (F), but the most straight forward is the boxplot, as show below.
p %>% ggplot(aes(x = position, y = weight)) + geom_boxplot() + theme_economist() + labs(xlab = "position", ylab = "weight in pounds", title = "Distribution of weight by position")## Warning: Removed 6 rows containing non-finite values (`stat_boxplot()`).
This answers the question, but it leaves me wanting a better visual
solution. A solution that: 1. Isolates centers and forwards 2.
Quantifies the weight distribution for each position
I will devote a half hour to seeking a solution.
First, to isolate the centers and forwards from the rest of the positions. Then, take the weight of the centers and then the weight of the forwards.
C = p[p$position == "C",]
Forw = p[p$position == "F",]
WC = .colMeans(C$weight, length(C), 1, na.rm = FALSE)
WF = .colMeans(Forw$weight, length(Forw), 1, na.rm = FALSE)
paste("The average weight of the centers is,", WC,"pounds, and the average weight of the forwards is,", WF,"pounds.")## [1] "The average weight of the centers is, 245.5 pounds, and the average weight of the forwards is, 221 pounds."
I recognize that although numerically pleasing, it doesn’t specifically address the issue of “visualizing” the difference in weights, but the distribution of weights is shown visually via the box plot. I’m hoping that the combined solution will satisfy the client.
This should be fairly straight forward (said the man headed to the gallows). I will repeat the Bullet 2 process. First, I’ll refresh the dataset. The operation will be to separate the feet and inches and convert feet to inches so we can have uniformity. First, the height column is a character. The task is to either split the characters of feet and inches into two separate columns, get rid of the “-” in between and then convert to numeric and then recreate the height column with units of inches. Or, the same operation, but first converting to numeric and then the rest. My guess is that the first method will be the way to go and that’s what I’m going to attack.
After spending way too much time trying to wrangle the data via RStudio, I took another approach which was to modify the data in Excel to inches. This is probably not the intended solution, but it’s hard to argue with results. And, if this were a client, I don’t think they’d care how you arrived at the solution. The DS6306 professor, on the other hand, may have something to say about this. But, I’ve been transparent with my methods.
After importing the new dataset, I will address the question posed in bullet 3.
j <- read.csv("/Users/tgarn/OneDrive/Desktop/SMU - MS Data Science/Courses/GitHub/DS_6306_weekly_assignments/Week_2/DS6306_week2/PlayersBBall.csv", header = TRUE)
j %>% ggplot(aes(x = position, y = height)) + geom_boxplot() + theme_economist() + labs(x = "position", y = "height in inches", title = "Player height by position")
We can see from the boxplot that the distribution of the Centers is
greater than the distribution of the Forwards.
Once again, we utilize the boxplot to visually see that each of the position’s heights are differently distributed.
j <- read.csv("/Users/tgarn/OneDrive/Desktop/SMU - MS Data Science/Courses/GitHub/DS_6306_weekly_assignments/Week_2/DS6306_week2/PlayersBBall.csv", header = TRUE)
j %>% ggplot(aes(x = position, y = height, color = position)) + geom_boxplot() + theme_economist() + xlab("Position") + ylab("height in inches") + labs(title = "Distribution of player height by position")For this question, we need to plot the years versus the heights as a
scatter plot and draw a trend line.
Perhaps there’s a more straight forward means to evaluate this claim. If
we performed a linear regression on the years and the heights of the
players, it should lead us to an equation that will answer this
question. As there are two dates to select from, I believe it’s best to
select one or the other as the height is not likely to have changed
much, if any during their career. To take the possibility of some change
out, I choose to select the ending year.
summary(lm(height~year_end, j))##
## Call:
## lm(formula = height ~ year_end, data = j)
##
## Residuals:
## Min 1Q Median 3Q Max
## -77.437 -2.483 0.134 2.394 12.609
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -33.550543 4.906767 -6.838 9.12e-12 ***
## year_end 0.056111 0.002466 22.750 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.639 on 4548 degrees of freedom
## Multiple R-squared: 0.1022, Adjusted R-squared: 0.102
## F-statistic: 517.5 on 1 and 4548 DF, p-value: < 2.2e-16
Height = Year*(.056111) - 33.550543
For 2000: Height = (2000*.05611) -33.550543 = 79.222 inches
Let’s check the correlation:
library(dplyr)
j <- read.csv("/Users/tgarn/OneDrive/Desktop/SMU - MS Data Science/Courses/GitHub/DS_6306_weekly_assignments/Week_2/DS6306_week2/PlayersBBall.csv", header = TRUE)
cor(j$year_end, j$height)## [1] 0.3196402
This strongly suggests that there is a positive correlation between height and years, from the data provided in the table.
My thought as I dig into this is, “What could go wrong?”
p <- plot_ly(j, x = ~height, y = ~weight, Z = ~year, color = position) %>% add_markers() %>% layout( scene = list(xaxis = list(title = “Height”), yaxis = list(title = “Weight”), zaxis = list(title = “Year”)))
library(plotly)
j <- read.csv("/Users/tgarn/OneDrive/Desktop/SMU - MS Data Science/Courses/GitHub/DS_6306_weekly_assignments/Week_2/DS6306_week2/PlayersBBall.csv", header = TRUE)
p <- plot_ly(j, x = ~height, y = ~weight, z = ~year_end, color = ~position) %>% add_markers() %>% layout(scene = list(xaxis = list(title = "Height"),yaxis = list(title = "Weight"), zaxis = list(title = "Year")))
p## Warning: Ignoring 6 observations
I found 4 of the different visualizations particularly pleasing: 1. Marginal Histogram/Boxplot 2. Treemap 3. Time Series Heat Map 4. Spatial
I’ve seen many of the maps on the web site and the treemap is LOADED with information. I’ve seen that on TV and in magazines. Rich with content. Most people have seen the Spatial maps and they are useful when using GIS or other terrestrial landscape features. I had not seen the Time Series Heat Map, but I truly liked it as it was very illuminating for functions that occur over time and for gleaning recurring patterns. Sales come to mind as does stock and securities trading. The one that I hadn’t seen and believe has the most appeal to the current dataset is the Marginal Histogram/Boxplot map. It provides an added 2 dimensions to otherwise static data. Let me see what I can do with it.
#```{r} #reload BBall CSV j <- read.csv(“/Users/tgarn/OneDrive/Desktop/SMU - MS Data Science/Courses/GitHub/DS_6306_weekly_assignments/Week_2/DS6306_week2/PlayersBBall.csv”, header = TRUE) # load package and data library(ggplot2) #install.packages(‘ggExtra’) library(ggExtra) data(j, package=“ggplot2”) # mpg <- read.csv(“http://goo.gl/uEeRGu”)
theme_set(theme_economist()) # pre-set the bw theme. BBall_select <- j[j\(year_start >= 2011 & j\)height >= 87] g <- ggplot(j, aes(position, height)) + geom_count() + #geom_smooth(method=“lm”, se=F)
#ggMarginal(g, type = “histogram”, fill=“transparent”) ggMarginal(g, type = “boxplot”, fill=“transparent”) #ggMarginal(g, type = “density”, fill=“transparent”) #``` Apparently, not much. I need to attack this again from another angle. Perhaps the Treemap for positions, then count the members of each, and the year? This dataset isn’t perfect for this feature. I’ve literally run out of time. I’m going to take a stab at the very last question with a different dataset EducationIncome.csv. Epic fail.
#```{r} library(ggplot2) #install.packages(‘ggplotify’) library(treemapify) library(ggplotify) proglangs <- read.csv(“/Users/tgarn/OneDrive/Desktop/SMU - MS Data Science/Courses/GitHub/DS_6306_weekly_assignments/Week_2/DS6306_week2/PlayersBBall.csv”, header = TRUE)
treeMapCoordinates <- treemapify(proglangs, area = “year_end”, fill = “weight”)
treeMapPlot <- ggplotify(treeMapCoordinates) + scale_x_continuous(expand = c(0, 0)) + scale_y_continuous(expand = c(0, 0)) + scale_fill_brewer(palette = “Dark2”)
print(treeMapPlot) #```
First, I’ll load in the dataset and take a look at what we’ve got.
Ed <- read.csv("/Users/tgarn/OneDrive/Desktop/SMU - MS Data Science/Courses/GitHub/DS_6306_weekly_assignments/Week_2/DS6306_week2/Education_Income.csv", header = TRUE)
head(Ed)## Subject Educ Income2005
## 1 2 12 5500
## 2 6 16 65000
## 3 7 12 19000
## 4 8 13-15 36000
## 5 9 13-15 65000
## 6 13 16 8000
#Let's explore how many unique values we have in the Educ column
unique(Ed$Educ, incomparables = FALSE)## [1] "12" "16" "13-15" ">16" "<12"
Seeing that there are 5 unique timesframes in years, I believe that the most straightforward method, given the remaining time, would be to use these characters to select the values for rising years in education and compare it against Income. I think we can discard the values “<12” and just go with “12” and compare that to “13-15”, “16” and “>16”.
data_12 <- Ed[Ed$Educ == "12",]
#head(data_12)
E12 <- data.frame(data_12$Income2005)
#head(E12)
summary(E12)## data_12.Income2005
## Min. : 300
## 1st Qu.: 19975
## Median : 31000
## Mean : 36865
## 3rd Qu.: 48000
## Max. :410008
data_13_15 <- Ed[Ed$Educ == "13-15",]
#head(data_13_15)
E_13_15 <- data.frame(data_13_15$Income2005)
#head(E_13_15)
summary(E_13_15)## data_13_15.Income2005
## Min. : 429
## 1st Qu.: 24000
## Median : 38000
## Mean : 44876
## 3rd Qu.: 58000
## Max. :257286
data_16 <- Ed[Ed$Educ == "16",]
#head(data_16)
E_16 <- data.frame(data_16$Income2005)
#head(E_16)
summary(E_16)## data_16.Income2005
## Min. : 200
## 1st Qu.: 32000
## Median : 56500
## Mean : 69997
## 3rd Qu.: 89000
## Max. :519340
data_over_16 <- Ed[Ed$Educ == ">16",]
#head(data_over_16)
E_over_16 <- data.frame(data_over_16$Income2005)
#head(E_over_16)
summary(E_over_16)## data_over_16.Income2005
## Min. : 63
## 1st Qu.: 40000
## Median : 60500
## Mean : 76856
## 3rd Qu.: 96000
## Max. :703637
The medians show that with rising education there is a strong suggestion that incomes also rise.
*I’m not happy with my work on this Live Set 2 at all. It’s beneath my quality standards. I spent countless hours spinning my wheels. I’ve got to do better in finding solutions. I took shortcuts with the modifying the dataset early on and didn’t have time to come back and change it, even though I figured out how to do so. I’m embarrassed to turn this in. I clearly have a lot of work to do in order to be more efficient with my time. My apologize to the professor for this subpar work.
sessionInfo()## R version 4.2.2 (2022-10-31 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19044)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.utf8
## [2] LC_CTYPE=English_United States.utf8
## [3] LC_MONETARY=English_United States.utf8
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.utf8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggthemes_4.2.4 plotly_4.10.1 dplyr_1.0.10 ggplot2_3.4.0
##
## loaded via a namespace (and not attached):
## [1] tidyselect_1.2.0 xfun_0.36 bslib_0.4.2 purrr_1.0.0
## [5] lattice_0.20-45 splines_4.2.2 colorspace_2.0-3 vctrs_0.5.1
## [9] generics_0.1.3 htmltools_0.5.4 viridisLite_0.4.1 yaml_2.3.6
## [13] mgcv_1.8-41 utf8_1.2.2 rlang_1.0.6 jquerylib_0.1.4
## [17] pillar_1.8.1 glue_1.6.2 withr_2.5.0 DBI_1.1.3
## [21] RColorBrewer_1.1-3 lifecycle_1.0.3 stringr_1.5.0 munsell_0.5.0
## [25] gtable_0.3.1 htmlwidgets_1.6.0 evaluate_0.19 labeling_0.4.2
## [29] knitr_1.41 fastmap_1.1.0 crosstalk_1.2.0 fansi_1.0.3
## [33] highr_0.10 scales_1.2.1 cachem_1.0.6 jsonlite_1.8.4
## [37] farver_2.1.1 digest_0.6.31 stringi_1.7.8 grid_4.2.2
## [41] cli_3.5.0 tools_4.2.2 magrittr_2.0.3 sass_0.4.4
## [45] lazyeval_0.2.2 tibble_3.1.8 tidyr_1.2.1 pkgconfig_2.0.3
## [49] Matrix_1.5-1 ellipsis_0.3.2 data.table_1.14.6 assertthat_0.2.1
## [53] rmarkdown_2.19 httr_1.4.4 rstudioapi_0.14 R6_2.5.1
## [57] nlme_3.1-160 compiler_4.2.2